Inflating Training Data for Statistical Machine Translation using Unaligned Monolingual Data
نویسندگان
چکیده
In data-driven machine translation, parallel corpora are an extremely important resource. For language pairs that involve English, there exist many freely available bilingual or multilingual parallel corpora, especially for European languages. To improve the translation quality for less-resourced language pairs, such as Chinese–Japanese, larger and larger aligned training data are needed. The constitution of large bilingual corpora is not easy for less documented language pairs. In this paper, we show how to construct a Chinese–Japanese quasi-parallel corpus automatically by using analogical associations based on a small amount of parallel sentences and a reasonable amount of monolingual data. We perform SMT experiments in Chinese–Japanese and compare a baseline system and a system build by adding the quasi-parallel corpus. On the same test set, the translation quality significantly improved over the baseline system.
منابع مشابه
Forms Wanted: Training SMT on Monolingual Data
We propose and evaluate a simple technique of “reverse self-training” for statistical machine translation. The technique allows to extend target-side vocabulary of the MT system using target-side monolingual data and it is especially aimed at translation to morphologically rich languages.
متن کاملImproving Neural Machine Translation Models with Monolingual Data
Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Monolingual data plays an important role in boosting fluency for phrase-based statistical machine translation, and we investigate the use of monolingual data for neural machine translation (NMT). In contrast to previous work, which integrates a sepa...
متن کاملImproved Statistical Machine Translation Using Monolingual Paraphrases
We propose a novel monolingual sentence paraphrasing method for augmenting the training data for statistical machine translation systems “for free” – by creating it from data that is already available rather than having to create more aligned data. Starting with a syntactic tree, we recursively generate new sentence variants where noun compounds are paraphrased using suitable prepositions, and ...
متن کاملImproving word alignment for low resource languages using English monolingual SRL
We introduce a new statistical machine translation approach specifically geared to learning translation from low resource languages, that exploits monolingual English semantic parsing to bias inversion transduction grammar (ITG) induction. We show that in contrast to conventional statistical machine translation (SMT) training methods, which rely heavily on phrase memorization, our approach focu...
متن کاملCross-lingual spoken language understanding from unaligned data using discriminative classification models and machine translation
This paper investigates several approaches to bootstrapping a new spoken language understanding (SLU) component in a target language given a large dataset of semantically-annotated utterances in some other source language. The aim is to reduce the cost associated with porting a spoken dialogue system from one language to another by minimising the amount of data required in the target language. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015